Consistent Weighted Sampling Made Fast, Small, and Easy
نویسندگان
چکیده
Document sketching using Jaccard similarity has been a workable effective technique in reducing nearduplicates in Web page and image search results, and has also proven useful in file system synchronization, compression and learning applications [6, 4, 5]. Min-wise sampling can be used to derive an unbiased estimator for Jaccard similarity and taking a few hundred independent consistent samples leads to compact sketches which provide good estimates of pairwise-similarity. Early sketching papers handled weighted similarity, for integer weights, by transforming an element of weight w into w elements of unit weight, each requiring their own hash function evaluation in the consistent sampling. Subsequent work [12, 19, 14] removed the integer weight restriction, and showed how to produce samples using a constant number of hash evaluations for any element, independent of its weight. Another drastic speedup for sketch computations was given by Li, Owen and Zhang [17] who showed how to compute such (near-)independent samples in one shot, requiring only a constant number of hash function evaluations per element. Unfortunately this latter improvement works only for the unweighted case. In this paper we give a simple, fast and accurate procedure which reduces weighted sets to unweighted sets with small impact on the Jaccard similarity. This leads to compact sketches consisting of many (near-)independent weighted samples which can be computed with just a small constant number of hash function evaluations per weighted element. The size of the produced unweighted set is furthermore a tunable parameter which enables us to run the unweighted scheme from [17] in the regime where it is most efficient. Even when the sets involved are unweighted, our approach gives a simple solution to the densification problem that [20, 21] attempt to address. Unlike previously known schemes, ours does not result in an unbiased estimator. However, we prove that the bias introduced by our reduction is negligible and that the standard deviation is comparable to the unweighted case. We also empirically evaluate our scheme and show that it gives significant gains in computational efficiency, without any measurable loss in accuracy.
منابع مشابه
Conformal Mapping with as Uniform as Possible Conformal Factor
According to the Uniformization Theorem, any surface can be conformally mapped into a domain of a constant Gaussian curvature. The conformal factor indicates the local scaling introduced by such a mapping. This process could be used to compute geometric quantities in a simplified flat domain with zero Gaussian curvature. For example, the computation of geodesic distances on a curved surface can...
متن کاملMedian Estimation in Sample Surveys
In a recent paper Maritz and Jarrett (1978) proposed a small-sample estimate of the variance of sample medians from continuous population. In this paper their methods are adapted to median estimation in s~atified sampling without replacement from finite populations. A weighted sample median for estimating the median of heavy-tailed or skewed populations is proposed. Its asymptotic normal distri...
متن کاملEfficient k-space sampling by density-weighted phase-encoding.
Acquisition-weighting improves the localization of MRI experiments. An approach to acquisition-weighting in a purely phase-encoded experiment is presented that is based on a variation of the sampling density in k-space. In contrast to conventional imaging or to accumulation-weighting, where k-space is sampled with uniform increments, density-weighting varies the distance between neighboring sam...
متن کاملUsing retrospective sampling to estimate models of relationship status in large longitudinal social networks
Estimation of longitudinal models of relationship status between all pairs of individuals (dyads) in social networks is challenging due to the complex inter-dependencies among observations and lengthy computation times. To reduce the computational burden of model estimation, a method is developed that subsamples the "always-null" dyads in which no relationships develop throughout the period of ...
متن کاملImproved Diagnostic Utility of T2-weighted 3D-TSE Liver Imaging by Suppression of Vascular Signals using a Motion- Sensitive Preparation
Introduction: Single-slab, three-dimensional turbo/fast spin-echo (3D-TSE) pulse sequences with variable-flip-angle refocusing RF pulses [1,2] (e.g., SPACE [Siemens] or 3D FSE CUBE [GE]), when combined with navigator-based respiratory triggering, provide a sufficiently high sampling efficiency to permit high-resolution, T2-weighted 3D imaging of the liver in a clinically-reasonable acquisition ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1410.4266 شماره
صفحات -
تاریخ انتشار 2014